Skip to content

CSHM pipeline, protocol, and documentation: cchsflow v3 variable setup#2

Merged
DougManuel merged 14 commits into
mainfrom
protocol
Jun 10, 2026
Merged

CSHM pipeline, protocol, and documentation: cchsflow v3 variable setup#2
DougManuel merged 14 commits into
mainfrom
protocol

Conversation

@DougManuel

Copy link
Copy Markdown
Collaborator

Summary

Brings the CSHM project from its initial scaffold to a working, shareable state: a {targets} pipeline that loads and harmonizes all 11 CCHS PUMF cycles (2001–2022) through cchsflow v3, a prespecified study protocol, per-stage workflow documentation, and a manuscript scaffold whose numbers draw from pipeline targets.

What's here

Pipeline (stages 1–8 active, 9–10 stubbed)

  • Variable setup from versioned worksheets: cshm-variables.csv (41 variables with roles and PUMF/Master source tags), an in-repo snapshot of cchsflow v3 variable_details.csv, and CSHM extension rows (GEOGPRV, WTS_M for 2019-20/2022)
  • load_study_data() with config-profile data sources, pre-flight cycle-coverage validation, cleaning (age floor, skewness-based truncation), MICE imputation, descriptive tables (ported from DemPoRT), and APC data preparation / model fitting
  • Config profiles: default/draft/dev/prod/statscan (RDC paths gitignored)
  • testthat suite: 43 tests passing

Smoking variables — cchsflow v3 final

Documentation (renders cleanly with quarto render)

  • docs/protocol/ — prespecified protocol and one-page summary
  • docs/workflow/ — one page per pipeline stage, generated from the worksheets
  • Divio-style how-to / explanation / reference sections

Known limitations (documented in the protocol and worksheets)

  • SMKDSTY_original (and downstream cigs_per_day, pack_years_der) unavailable for the 2022 PUMF (SMK_05D moved to Master-only); age_start_smoking unavailable for 2019-20+ PUMF (SMKG040 dropped). Both surface as validation warnings, not errors; Master access at the RDC fills the gaps.
  • The cchsflow package installs from a local v3 checkout until fix(v3): regenerate NAMESPACE and repair smoking worksheet derivations cchsflow#186 merges (the v3 branch currently fails R CMD INSTALL from GitHub).
  • Stages 3 and 5 (cleaning, imputation) are implemented but not yet exercised end-to-end — next phase of work.

DougManuel and others added 13 commits March 27, 2025 11:31
…v3 smoking variables

Survey config now uses Option A: each variable has pumf/master sub-entries
with var, min, and max fields. New survey_var() and survey_bound() accessors
in R/config-utils.R resolve the active data source automatically.

Smoking variables updated to unified cchsflow v3 names:
- SMKDSTY -> SMKDSTY_original (CEP-002 year-based naming)
- SMK_09A_cont/SMK_09C -> time_quit_smoking_daily
- Added age_start_daily, cigs_per_day, pack_years config keys
- Demoted 11 intermediate variables in cshm-variables.csv

APC cessation logic corrected for SMKDSTY_original categories:
scope changed from c(1,2,3,4) to c(1,2,4) — excludes always-occasional
smokers (cat 3) who never smoked daily.

Stage 1 workflow QMD now generates two tables: variable definitions and
cycle coverage matrix (PUMF/Master per year).
Sync worksheets/cshm-variables.csv to the merged cchsflow v3 smoking
variables: verbatim v3 databaseStart/variableStart, full transitive
DerivedVar feeder closure (41 rows), corrected source columns, and
SMKDVSTP notes (no longer a v3 feeder). Drop coverage claims for
variables absent from the 2019-20 PUMF (SMKG040 family).

Wire the pipeline to an in-repo snapshot of cchsflow v3
variable_details.csv (worksheets/cchsflow-variable-details.csv) so the
repo runs without a sibling cchsflow checkout. The snapshot carries
local fixes for seven worksheet defects found during validation
(cchsflow #184, #185; cchsflow-data #3). Trim CSHM extension rows to
GEOGPRV and WTS_M now that v3 covers age and sex for 2001-2023.

Attach cchsflow via tar_option_set in _targets.R (v3 derivation
functions need its Depends on the search path), record the local
cchsflow v3 install and here in renv.lock, and update CLAUDE.md and
the variable-setup workflow doc accordingly.

Validated by harmonizing 1% samples of CCHS 2001, 2015-16, and
2019-20: all unified smoking variables derive with plausible
non-missing rates; known gaps are SMKDSTY_original for 2022 PUMF and
age_start_smoking for 2019-20+ PUMF (warnings, not errors).

CLAUDE.md, docs/workflow/1-variable-setup.qmd, and renv.lock also
carry earlier uncommitted edits from the in-progress repo restructure.
StatCan shipped the 2015-16 PUMF with valid skip and not stated pooled
into SMKG035 code 11 ("age 50 or older"; 46.3% weighted per their own
data dictionary, absent from the errata). The uniform midpoint rule
turned this into age 55 at first cigarette for ~44k never-smokers.
The snapshot now maps cchs2015_2016_p code 11 to NA(a):
age_first_cigarette for 2015-16 returns to 58.8% non-missing
(median 16) from a corrupt 100%.

Forensics: Big-Life-Lab/cchsflow-data#3. Upstream fix in
Big-Life-Lab/cchsflow#186 alongside the cat5/SMKG040 repairs already
captured in the previous snapshot commit.
Retire the interim smoking implementation (R/smoking.R,
R/process_smoking_initiation.R) to R/legacy/ for reference; the
pipeline now derives smoking variables through cchsflow v3. Drop the
stale test for the retired function (its replacement is covered by
test-apc-data.R).

Remove the old config/ YAML and variable CSVs (replaced by config.yml
profiles and the worksheets/ structure) and the superseded
project-specification and protocol stubs (replaced by docs/protocol/).
…idation

Port the descriptive-statistics engine and worksheet helpers from
DemPoRT (get-descriptive-data.R, create-descriptive-tables.R,
variables-sheet-utils.R, variable-details-sheet-utils.R) and add the
CSHM stage functions: clean_study_data(), impute_data() (MICE),
prepare_apc_data()/fit_apc_model(), rate-table and prevalence
validation stubs, and the pre-flight cycle-coverage validator.

load_study_data() gains data_source filtering, raw_data_file_map
support for cchsflow-data release files, and the as.data.frame()
guard for the rec_with_table() tibble bug.

Add testthat coverage for APC data preparation and descriptive
tables, the LinkML role-vocabulary schema (cshm-variables.yaml),
the PUMF object renaming script, and the RDC config template.
…fold

Reorganize the docs site around three purposes: the prespecified
study protocol (docs/protocol/), one workflow page per pipeline stage
(docs/workflow/ stages 1-8), and Divio-style how-to / explanation /
reference sections. Add the manuscript scaffold rendered to Word via
the docstyle extension, with all numbers drawn inline from pipeline
targets.

Update README, CONTRIBUTING, LICENSE, site config, and styles for the
restructure; vendor the docstyle and fontawesome Quarto extensions;
ignore manuscript render output, resources/, and machine-local Claude
settings; commit shared project settings (.claude/settings.json).
Drop the stale docs/_extensions/docstyle.bak copy and the
manuscript/.quarto project cache; ignore the cache going forward.
clean_study_data() now excludes respondents below cfg$age_exclusion_min
using the continuous age variable, replacing the age-group-code
mechanism whose survey key (age_grouped) was removed in the config
restructure. Drop the age_grouped column from the APC test helper and
remove the test for the retired map_variable_data() (now in R/legacy).

Full suite: 43 pass, 0 fail.
The manuscript is rendered separately to Word via docstyle and reads
pipeline targets that fresh clones will not have; the site render now
skips it. Full site renders cleanly.
Copilot AI review requested due to automatic review settings June 10, 2026 15:10

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

…ents

From the pre-merge review (four review agents over PR #2):

- clean_study_data() role filter compared whole comma-separated role
  strings against single roles and matched nothing, so the skewness
  check and truncation silently processed zero variables. Now uses
  select_vars_by_role().
- Cycle-1 survey year corrected to 2001 (CCHS 1.1 collected Sept
  2000-Nov 2001); config value and test both said 2002 while the
  inline comment and cycle label said 2001. Shifts cycle-1 cohort
  assignment by one year — flagged for confirmation.
- survey_cycle_code() unknown-name guard was dead code (subscript
  error fired first); now checks names() membership.
- Comment/config accuracy: never-smoker NA(a) vs 50+ midpoint 55 in
  apc-model.R; PUMF initiation floor comments now note the 5-11
  (midpoint 8) category excluded by the floor of 13 (open decision);
  age max 85 not 80; ethnicity mapping SDCGCGT (SDC_RACEM/SDCFRAC do
  not exist); CLAUDE.md default-profile data path, legacy file path,
  draft profile; _targets.R store comments; schema roles.csv pointer;
  smoking-histories.R and validation.R headers; %||% comment.

Tests: 43 pass, 0 fail.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants